Every "cut your LLM costs" tutorial says the same thing: send fewer input tokens. Trim the system prompt, summarise the history, shrink the context. It feels obvious, and it is mostly wrong. Input-token count is the cheapest lever on the board, and optimising it first is how teams spend a week saving 8% while the real cost sits untouched.
The bill is set by a mix of three token types priced very differently. Once you see the mix, a counterintuitive move follows: sending more input tokens, in the right shape, routinely lowers the total cost of a request. Here is the mechanism, with the retrieval pipeline I use to exploit it.
1. Three token types, three prices
A single LLM call is not billed at one rate. There are at least three, and they are not close to each other:
Two consequences fall straight out of this table. First, one output token can cost as much as thirty to fifty cached input tokens. If you are going to optimise anything, optimise the length of what the model writes, not what it reads. Second, input tokens are not a fixed price: a token you can serve from cache is almost free. "Input tokens" is not one number, it is two, and the gap between them is an order of magnitude.
(fresh_in x 1) + (cached_in x 0.1) + (out x 5). Minimising raw token count optimises the term with the smallest coefficient.
2. Where the money actually leaks: output tokens
Here is the part most cost guides miss. The shape of your input controls the length of your output. Feed a model a messy, unranked context dump and it does not just answer. It orients itself out loud.
You have seen the symptom in the generated text: "Based on the provided documents, it appears that several sources discuss... Document 3 seems most relevant, although Document 1 also touches on... Let me synthesise these..." That is the model spending expensive output tokens narrating its own search through context you handed it in a bad order. Noisy, contradictory, or unranked chunks make this worse, because the model hedges, restates, and re-derives instead of answering.
Naive RAG, which embeds documents and stuffs the top-k cosine matches into the prompt, is a machine for producing exactly this. The top cosine matches are not the most useful chunks, only the most superficially similar, so the model gets a pile it has to sort through at 5x the input price.
3. The quieter leak: cache misses
Prompt caching only fires on a stable prefix. The provider hashes the leading tokens of your request and reuses the computed attention state if the next request starts identically. Break the prefix by a single byte and you pay full freight again.
This is where people misread caching, so let me be precise about what it does and does not buy you in RAG:
- Retrieved chunks are not cacheable across different queries. Each question retrieves different context, so that block is fresh every time. Anyone promising "cache your whole RAG context" is selling something.
- Your static scaffold absolutely is cacheable, but only if it is byte-stable. System prompt, tool definitions, formatting instructions, and few-shot examples should be deterministic and placed first. Inject a timestamp, a random request ID, or reorder your tool list per call, and you bust the cache on the one part that could have been free.
- Inside an agent loop, the retrieved context is reused across turns. A reason-then-act agent calls the model several times over the same retrieved block. If that block is emitted deterministically, every turn after the first reads it from cache. A pipeline that re-sorts or re-formats context between turns pays the base rate on every turn instead.
So caching rewards determinism. A retrieval stage that emits the same ranked payload the same way every time is cache-friendly. One that returns chunks in nondeterministic order is quietly paying the miss penalty on its scaffold and across every agent turn.
4. The fix: spend input tokens to buy back output tokens
This is the inversion. Instead of minimising the context, you invest in making it structured, ranked, and deterministic, then let that smaller-but-richer payload collapse the output and stabilise the prefix. The retrieval pipeline that does it has three stages.
First, hybrid retrieval: run dense vector search and BM25 keyword search in parallel, then fuse them with Reciprocal Rank Fusion. RRF needs no score normalisation because only rank positions matter, so a chunk that both methods rank highly floats to the top. This is the actual fusion code from my engine:
@staticmethod
def _rrf(vec_hits, bm25_hits, rrf_k=60):
"""Fuse two ranked lists via RRF. A doc in both lists scores higher."""
def key(doc): return (doc["file"], doc["chunk"])
rrf_scores = {}
for rank, doc in enumerate(vec_hits):
k = key(doc); rrf_scores[k] = rrf_scores.get(k, 0.0) + 1.0 / (rrf_k + rank + 1)
for rank, doc in enumerate(bm25_hits):
k = key(doc); rrf_scores[k] = rrf_scores.get(k, 0.0) + 1.0 / (rrf_k + rank + 1)
seen = {}
for doc in vec_hits + bm25_hits:
seen.setdefault(key(doc), doc)
return sorted(seen.values(), key=lambda d: rrf_scores.get(key(d), 0.0), reverse=True)
Second, cross-encoder reranking: take the fused top-50 candidate pool and rescore it with a cross-encoder that reads the query and each passage together. A bi-encoder embeds query and passage separately and compares vectors; a cross-encoder judges them jointly and is far more accurate about actual relevance. You run it only on the 50-candidate pool, so the latency is bounded, and you keep the top 5.
Third, deterministic formatting: emit those 5 chunks in rank order, numbered, in a fixed template, every time. Same query shape, same bytes.
The payload that reaches the model might be 20% larger than a naive top-5 cosine dump, because reranking lets you confidently include a couple more high-value chunks. But it is clean and ordered. The model stops narrating its orientation and answers directly. Output tokens fall, and the deterministic block caches across agent turns.
5. The math, worked
Take a representative 2026 pricing point of $3 per million input tokens, $15 per million output (5x), and cached reads at $0.30 per million (0.1x). Compare one query under each approach. These numbers are illustrative, not a benchmark of your workload, but the direction is the point.
NAIVE: 4,000 fresh input + 800 output (model hedges through unranked chunks)
= 4000 * $3/M + 800 * $15/M
= $0.0120 + $0.0120 = $0.0240
STRUCTURED: 4,800 input (+20%), but 4,000 is a stable cached prefix,
and clean ranking cuts output to 350 tokens
= (4000 cached * $0.30/M) + (800 fresh * $3/M) + (350 out * $15/M)
= $0.0012 + $0.0024 + $0.00525 = $0.00885
Result: ~63% cheaper, while sending MORE input tokens.
The structured request processes more input and still costs a third as much, because it moved spend off the two expensive coefficients (output and fresh input) and onto the cheap one (cached input). That is the whole trick. You did not minimise tokens. You re-balanced the mix.
6. When this does not apply
Honesty about the boundaries, because this is not a universal law:
- Single-shot, high-output tasks (draft a 2,000-word article) are output-dominated no matter what. Retrieval shape barely moves the bill; pick a cheaper generation model instead.
- Tiny contexts (a 300-token prompt) are below the level where caching and ranking matter. Do not build a reranking pipeline to save fractions of a cent.
- Providers without prompt caching, or with short cache TTLs that expire before your next call, lose the caching half of the argument. The output-reduction half still holds.
- The reranker is not free. A cross-encoder adds 200-400ms and, if hosted, its own compute cost. On high-volume, latency-critical paths, weigh that against the token savings.
What I Built
The pipeline above is the RAG Knowledge Engine: hybrid BM25 + vector retrieval fused with RRF, a cross-encoder reranker over the top-50 pool, deterministic context formatting, and a RAGAS-style evaluator. 25 tests, all mocked, so the suite runs with no live services. The same retrieval shape powers the grounded chatbot in the corner of this site. The cost behaviour described here is the reason it is built this way, not a naive top-k dump.
If you take one thing from this: stop counting tokens, and start pricing the mix.